Genome signatures, self-organizing maps and higher order phylogenies: a parametric analysis.

Genome signatures are data vectors derived from the compositional statistics of DNA. The self-organizing map (SOM) is a neural network method for the conceptualisation of relationships within complex data, such as genome signatures. The various parameters of the SOM training phase are investigated for their effect on the accuracy of the resulting output map. It is concluded that larger SOMs, as well as taking longer to train, are less sensitive in phylogenetic classification of unknown DNA sequences. However, where a classification can be made, a larger SOM is more accurate. Increasing the number of iterations in the training phase of the SOM only slightly increases accuracy, without improving sensitivity. The optimal length of the DNA sequence k-mer from which the genome signature should be derived is 4 or 5, but shorter values are almost as effective. In general, these results indicate that small, rapidly trained SOMs are generally as good as larger, longer trained ones for the analysis of genome signatures. These results may also be more generally applicable to the use of SOMs for other complex data sets, such as microarray data.


Introduction
Molecular evolutionary methodology revolves around the production of sequence alignments and trees. However, as evolutionary distance increases between two homologous molecules, their similarity may decay to the point where they are no longer alignable. Construction of a phylogenetic tree under such circumstances becomes impossible. One method that has been suggested for the study of distant evolutionary relationships is that of genomic signatures or genome signatures † (Karlin and Ladunga, 1994;Karlin and Burge, 1995;Karlin and Mrazek, 1996). At least one reviewer has come to the conclusion that it is the preferred method in cases where evolutionary distance, recombination, horizontal transmission or variable mutation rates may confound traditional alignment-based techniques (Brocchieri, 2001).
The fi rst derivation of genome signatures predates the invention of DNA sequencing. Biochemical studies revealed that the frequencies of nearest-neighbour dinucleotide pairs in DNA were generally consistent within genomes, and often different between genomes. These characteristic nearest neighbour patterns were termed general schemes (Russell et al. 1976;Russell and Subak-Sharpe, 1977), and constitute, in modern terminology, a subset of genome signatures, those of length k = 2.
As long DNA sequences began to be isolated and computers entered the biological laboratory, it became a simple matter to produce nearest-neighbour frequency tables. Indeed, for any DNA sequence of length N, it is theoretically possible to derive frequency tables for all k-mers ranging from 1 to N, within that sequence. The frequency table at k = 1 corresponds to the raw nucleotide content on one strand. On the assumption that DNA is double stranded under most circumstances in most species, the complementary bases are also scored. This reduces the raw count of the four bases to a single value, between zero and one, representing the GC content of that DNA sequence. Correspondingly, at k = 2, the raw count of 16 dinucleotide frequencies, can be reduced to a vector containing 10 values if the count for each dimer on the top strand is added to the count for its complement on the other strand. There are 10 values, not 8, in this vector since GC, CG, AT and TA are self-complementary. This process † Both genome signature and genomic signature are used interchangeably in the fi eld, including by their originators. However, the term genome signature is to be preferred, since genomic signature is used in the fi eld of molecular diagnostics to refer to a genotype correlated with medical symptoms or prognosis (e.g. Russo et al. 2005) is called symmetrization (Karlin and Ladunga, 1994). The symmetrized values in the vector are then usually corrected for the frequencies of their component monomers, as follows: where f XY is the symmetrized frequency of dinucleotide XY, and f X and f Y are the symmetrized frequencies of bases X and Y, respectively. The whole vector is referred to as the genome signature at k = 2 or, particularly in the extensive literature of the Karlin group, simply as ρ * XY . For all values of k, the nomenclature GS-k is here adopted.
The vector thus becomes an array of the ratios of observed frequencies of k-mers to their expected frequencies given an underlying zero-order Markov chain model of a DNA sequence. Even though symmetrization will reduce the size of the vector for large values of k, it is apparent that it will still grow in size at the order of 4 k for an alphabet of length 4. In practice, most investigators have confi ned themselves to the study of genome signatures of k = 2, in other words to ρ * XY , symmetrized dinucleotide frequencies corresponding to general schemes, although in recent years the availability of faster computers has undoubtedly contributed to the increasing use of genome signatures up to k = 10 ( Deschavanne et al. 1999;Edwards et al. 2002;Abe et al. 2003a;Sandberg et al. 2003;Campanaro et al. 2005;Dufraigne et al. 2005;Wang et al. 2005;Paz et al. 2006).
The length of DNA required to generate a genome signature has conventionally been taken to be around 50 kb, and for this value it has been observed that the Hamming or Euclidean distances between signatures derived from contigs within species are generally considerably smaller than the corresponding average values between species (Karlin and Ladunga, 1994;Karlin and Burge, 1995;Abe et al. 2002;Teeling et al. 2004), even when the same-species contigs are on different chromosomes (Gentles and Karlin, 2001). However, recent work has established that genome signatures within species may be stable over lengths as short as 10 kb (Deschavanne et al. 1999;Karlin, 2001;Abe et al. 2002) or less (Sandberg et al. 2001;Jernigan and Baran, 2002;Abe et al. 2003a;Sandberg et al. 2003;McHardy et al. 2007). This has led to their practical applica-tion in the detection of pathenogenicity islands (pIs) in pathogenic bacteria. These are sequences originating in horizontal transmission from one bacterium to another, converting a previously innocuous strain into a pathogenic one. Their foreign origin is often reflected in a genome signature closer to their species of origin than their current host genome (Karlin, 1998;Karlin, 2001;Dufraigne et al. 2005).
Phylogenetic conclusions drawn from comparison of genome signatures have sometimes been controversial. For instance,  found that cyanobacteria do not form a coherent evolutionary group, and that Methanococcus jannaschii is closer to eukaryotes than to other proteobacteria, and Campbell et al. (1999) suggested that archaea do not form a coherent clade. Karlin (1998) posited a wide variety of further revisions of the prokaryotic phylogeny based on genome signature results, as well as a novel origin for mitochondria . Edwards et al. (2002) used genome signatures as part of a revision of the phylogeny of birds. Nevertheless, few authors have felt confi dent enough to draw phylogenetic trees based on genome signature comparisons. Coenye and Vandamme (2004) have shown that dinucleotide content is only a reliable indicator of relatedness for closely related organisms. To visualize genome signature relationships between species, a variety of other representational schemes have been used including histograms , partial ordering graphs , chaos games (Deschavanne et al. 1999;Edwards et al. 2002;Wang et al. 2005), and self-organizing maps (Abe et al. 2003b).
This paper uses self-organizing maps (SOMs) as a tool to explore genome signature variability at phylogenetic levels from superkingdom down to genus. The SOM is a neural network method which spreads multi-dimensional data onto a twodimensional surface (Kohonen, 1997). Its endpoint is therefore similar to multi-dimensional scaling or principal components analysis, and like these other techniques has been extensively used in biology, principally for the analysis of microarray data but also to a lesser extent for sequence analysis (Arrigo et al. 1991;Giuliano et al. 1993;Andrade et al. 1997;Tamayo et al. 1999;Kanaya et al. 2001;Wang et al. 2001;Abe et al. 2002;Covell et al. 2003;Ressom et al. 2003;Xiao et al. 2003;Mahony et al. 2004;Oja et al. 2005;Abe et al. 2006;Samsonova et al. 2006). The resulting "fl at" representation may be a strong aid to intuitive understanding of the structure of complex multidimensional datasets. The SOM is not a clustering technique per se, but the surface may be divided up into zones that are then treated as clusters. Alternatively, cluster boundaries on the surface may be defi ned more objectively using additional algorithms (Ultsch, 1993). The SOM is also not hierarchical (unlike UPGMA but like K-means clustering, two other commonly used techniques for the analysis of microarrays). This absence of hierarchy means that it is particularly suited to situations where the natural hierarchy of species relationships, refl ecting evolutionary descent, may have been violated, e.g. by horizontal gene transfer.
In this paper, the main parameters of the SOM: its size and the number of iterations used in its construction, are investigated for their effects on its classifi catory accuracy. These parameters must be chosen at the beginning of each run of SOM building, and there is little guidance in the SOM literature as to their optimal values. As well as the parameters of the SOM, the value of k used in the genome signature is similarly examined. High k genome signatures are extremely long vectors that may present considerable memory problems even on modern computers. Likewise, lengthy iterations in training the SOM, especially if it is a large one, may consume considerable time.

Genome sequences
Complete genome sequences were downloaded from NCBI Taxonomy Browser (http://www.ncbi. nlm.nih.gov/Taxonomy/taxonomyhome.html/). A Perl script was written to divide complete genome sequences into consecutive strings of 10 or 100 kb, as required. Trailing ends, and genomes shorter than the required string length, were discarded. The resulting FASTA-formatted datasets were then processed to calculate their genome signatures. Table 1 lists the genomes used as the main data set for the paper, that of viruses of the family Herpesviridae. The analyses shown in Figures 3 to 7 use this set. A larger set of genomes with the widest possible phylogenetic range, including all three superkingdoms of cellular life as well as viruses, is given in Table 2. These are used for the "all-life" and superkingdom-level SOMs in Figure  1. Table 3 lists those viral genomes used for the SOM across a wide set of viral genomes, displayed in Figure 2.

Calculation of genome signatures
A Perl script was written to derive raw k-mer counts on FASTA-formatted databases of input sequences, using the SeqWords.pm module from BioPerl (http://www.bioperl.org/Pdoc-mirror/bioperl-live/ Bio/Tools/SeqWords.html). The raw k-mer frequencies were then symmetrized, as follows: where f ν and f ν-comp are the raw frequencies of a k-mer ν and its complement ν-comp. Symmetrization means that a sequence and its complement will generate the same answer. The symmetrized frequencies are then corrected for the 1-mer content. For instance for a 2-mer XY, where X and Y can each represent any nucleotide base {A, C, T, G}: where f sXY is the symmetrized frequency for dimer XY and f sX and f sY are the symmetrized frequencies of its component 1-mers. For a 3-mer XYZ, the correction would be for the 1-mers, X, Y and Z and so on.
The genome signature vector for length k, is thus composed of a series of ratios of observed to expected values of its component k-mers, where the expected values are determined by a zeroorder Markov chain (Bernouilli series) model. Genome signatures are therefore not distorted by gross base compositional differences between genomes, which would otherwise be the dominant factor.

Self-organizing map
Self-organizing maps (SOMs) were run following Tamayo et al. (1999), using a Perl script. Input consisted of an array of the genome signatures generated as described above. The dimensions of the SOM and the number of iterations in training were variables entered by the user. Euclidean distances were used when comparing vectors.  (Fauquet et al. 2005 Once the dimensions of the SOM were set, x columns by y rows, weight vectors initializing each of the xy cells of the SOM were selected at random from the entire set of genome signature data vectors. The SOM is thus initially simply fi lled with a random subset of the data. Training then commences, for nominated t iterations. At each iteration m, each data vector in turn was compared to each weight vector, and the closest weight vector for each data vector designated the winning weight vector of that data vector in that iteration. Each time a winning weight vector is identifi ed, the winning weight vector, and the weight vectors of cells within a spatial range R on the SOM, were then trained by the data vector as follows.
Each value c in the winning weight vector w is altered, so that its value at iteration, m, becomes at the next iteration m+1: where w c m -v c represents the difference between the winning weight vector and the data vector for each value c along the vectors. In other words, one simply aligns the data vector and the winning weight vector and subtracts them. Each value of the winning weight vector is then altered to bring it closer to the data vector by a factor of τ, the training effect, which is derived as follows: Table 2. Genomes used for the analysis shown in Figure 1. In total there are 79 eukaryotic, 156 eubacterial, 30 archaeal and 122 viral genomes with more than 100kb of sequence.

Name Superkingdom Accession
where m is the number of the current iteration, and t the number of total iterations requested. Therefore, the number of iterations of the SOM, a parameter chosen at the start of the process, determines the gradient at which α will decrease as the iterations progress. Whereas α is the same for all cells in the SOM and changes according to the iteration number only, γ is the Euclidean distance on the SOM from the weight vector being trained within range of the winning weight vector. τ can therefore be seen to decrease as the SOM progresses, since α decreases, and also to decrease the further one goes away from the winning weight vector, since γ increases.
The range within which weight vectors are trained at each iteration is calculated: where S is the length or breadth of the SOM, whichever is the smaller. The area of the SOM being trained therefore also shrinks as α decreases with increasing iterations.
Once each data vector has found its winning weight vector and trained it, also training the weight vectors within range ℜ of the winning weight vector, then one iteration is completed. New values of α,τ and ℜ are then calculated, and the second iteration can commence. It can be intuitively grasped that there is a great deal of "churn" in initial iterations of the SOM. When α is close to 1, data vectors will effectively change their winning weight vector to copies of themselves. Only at the limits of the trained area R will the effect be subtler. However, as the number of iterations mounts, α will decrease and each data vector will have a relatively weaker effect on its winning weight vector and even less on those weight vectors in its vicinity. Observation (data not shown) of distribution of a simple data set over a SOM through the iterative process shows that a relatively chaotic process dominates until approximately halfway through the nominated number of iterations, at which point structure rapidly builds in the SOM. The fi nal 10% or so of iterations consist mostly of fi ne-tuning of the fi nal weight vector values. Training SOMs can also be time consuming, especially for large data sets of high dimensionality vectors trained over large numbers of iterations. The longest run presented here (that in Fig. 2) took in excess of 3 weeks on a single 2.8 GHz Intel processor under a Linux operating system. One of the major motivations of this paper was to defi ne ways to reduce SOM training time without losing accuracy or sensitivity.
After the fi nal iteration, each data vector is again compared to each weight vector and assigned to the closest. This results in partition of each data vector to one cell in the SOM, thus spreading the multi-dimensional data across the two-dimensional surface of the SOM. Conversely, each fi nal weight vector in the SOM is assigned to its closest data vector, the centroid nearest neighbour (cnn). If the data vectors belong to several categories, each cell in the SOM can be colored according to the origin of its cnn, which is then said to dominate that cell in the SOM. This allows the production of colorcoded dominance maps indicating the general spread of the data vector set over the SOM. NCBI taxonomic categories were used throughout, except for herpesviruses where the International Committee on the Taxonomy of Viruses (ICTV) usage is followed (Davison 2002;Davison et al. 2005;Fauquet et al. 2005).

Availability of scripts
All Perl scripts, for processing genomes, calculating genome signatures, and running SOMs are available on request from the author (d.gatherer@mrcvu.gla. ac.uk).

SOMs on large sequence datasets
The ability of SOMs to distinguish the origin of fragments of DNA based on their genome signatures, was initially tested using GS-2 (see Methods, section 2, above) measured over fragments of 100 kb. At the time of analysis there were 79 eukaryotic, 156 eubacterial, 30 archaeal and 122 viral genomes with more than 100 kb of sequence each ( Table 2). The dimension of the SOM was 50 × 50 and 100 iterations were used.
At the end of the iterations, dominance areas (see Methods, section 3, above), were used to color the SOM. For the entire data set, "all life" in Figure 1, the superkingdoms of archaea, eubacteria and eukaryota were chosen, along with the unranked category of viruses. Within each of the SOMs applied to the superkingdoms and the viruses, the next level down was used for coloring dominance maps. This is the phylum level in the archaea and eubacteria, and the family level in the viruses. In the eukaryota, the relative scarcity of completely sequenced genomes required a more ad hoc classifi cation.
When all input sets are pooled, GS-2 produces a SOM in which eubacterial sequences cluster together ( Fig. 1; "All life", green). Archaeal sequences are split into several groups that are situated along the boundary between the eubacteria and the eukaryotes. Likewise, viral sequences are split into one group in the top left corner and other clusters along the eubacterial-eukaryotic border. It is evident that this "all life" SOM does not contribute to the issue of the phylogeny of the three superkingdoms, except to underline that archaea are not derivatives of either eukaryotes or eubacteria.
When the SOM is confined to archaeal sequences ( Fig. 1; "Archaea"), those genomes   designated "unclassified" by NCBI, are located well within the territory of the Euryarchaeota, strongly suggesting that they belong to this phylum. In general the archaeal inter-phylum boundaries are clear, although the Crenarchaeaota are split into two clusters. The predominance of Euryarchaeota in terms of area is a reflection of the larger number of complete genomes in that phylum. Likewise, in the eukaryotes ( Fig. 1; "Eukaryota"), the large size of the human genome contributes to a large area dominated by the Vertebrata. It should be remembered that the classifi cation in the eukaryotes is ad hoc owing to the relatively small number of complete genomes. However, it is interesting that the boundaries between the dominance areas are as distinct as those in the archaea.
The situation is considerably more complicated within the eubacteria ( Fig. 1; "Eubacteria"), being the superkingdom with the greatest number of completely sequenced genomes. Some eubacterial phyla are rather fragmented in their dominance areas. For instance, the phylum Firmicutes occupies several partly adjacent areas. The phylum Deinococcus has two small and rather distant dominance areas, and the Bacteroidetes and Spirochaetes both have small outlying fragments. The Proteobacteria dominate the right side of the SOM and penetrate between the various groups on the left side. The overall impression is of less clear-cut differences in GS-2 between phyla in eubacteria than in eukaryotes or archaea.
A similar situation is observed in the SOM on viral sequences ( Fig. 1; "Viruses"). A few viral families, such as the Baculoviridae, the family Mimivirus and the Nimaviridae do manage coherent dominance areas, but all others are extensively mixed. The Baculoviridae are the only family of any size than maintain a distinctive dominance area.
This basic illustration of the SOM in action demonstrates that for a single parameter set, namely 50 × 50 SOM and 100 iterations, different phylogenetic groups exhibit variable degrees of partition across the SOM.

Increased resolution SOM on viruses
To increase the resolution of the SOM against viral sequences, GS-2 was reapplied to viral sequences only using 10 kb fragments. This enables a larger number of viral genomes to be analysed, up from 122 to 579, as genomes of 10 kb or more can be included (Table 3). The number of iterations was increased to 1000. The resulting dominance map is shown in Figure 2.
When viral sequences alone are considered at higher resolution, the SOM becomes very complex. The family level classifi cation is maintained for the dominance map but there are now more families, since viruses as small as 10 kb are eligible. Perhaps the most salient feature is that Poxviridae are divisible into sheep/goat pox viruses and others (Fig. 2: "sheep/goat" and "other pox"). Additionally phages, within the family Caudovirales, tend to be differentially located on the SOM in four major areas, one of which,  Figure 5. The density of herpesviral sequences, classed by family, on a 10 × 10 SOM after 100 iterations. >95% density: red; 5%-95% density: yellow; <5% density: white. The fi gure in each box is the ratio of sequences in red to yellow areas of the SOM. mycophages, accounts for two of these areas ( Fig.  2: "myco-ϕ", "entero-ϕ" and "cocco-ϕ"). Again the Baculoviridae form a noticeably large and coherent cluster. Herpesviridae, by contrast, are spread across the entire map.
Herpesviridae (Table 1) are next considered alone under the same conditions as in Figure 2. Dominance maps for this narrower selection are shown in Figure 3. Figure 3 shows that when family-level taxonomy is considered within herpesviruses, GS-2 distinguishes the ostreid herpesviruses and the ictaluriviruses as two fairly homogenous blocs distinct from the Alloherpesviridae (Davison, 2002), comprising the alpha, beta and gamma families. At the genus-level, Muromegalovirus alone forms a nearly contiguous bloc although Mardivirus nearly does so. The remaining genera, like the families, are considerably mixed across the SOM. Like the wide spread of herpesvirus signatures across the viral SOM, this is a refl ection of the degree of sequence heterogeneity with the Herpesviridae.
The three fi gures presented above demonstrate that the SOM is an intriguing tool for the conceptualisation of relationships between genome signatures. However, the evident complexity of some of the topographical arrangements raises serious questions concerning its utility as a diagnostic tool for phylogeny. Therefore, some experiments are described which address this issue in a quantitative way.

Effect of length of k-mer used to generate genome signature
In order to investigate if genome signatures of longer k give better resolution than k = 2, 10 kb herpesvirus sequences were processed into genome signature of GS-2 to GS-6 and the SOM was trained for 100 iterations (Fig. 4). On fi rst inspection, it does not appear that a higher genome signature provides any better resolution than a lower one. The GS-3 SOM was also run on a 20 × 20 map, but again this produces no major change to the overall pattern. In all cases, ostreid herpesvirus and ictalurivirus have coherent dominance areas on the SOM. At GS-5, alpha herpesviruses also have a coherent dominance area, but this disappears again at GS-6. In order to further investigate this apparent lack of improvement at higher values of k, the density of sequences of each family was plotted onto the SOM (Fig. 5). Instead of the dominance map approach, in which each cell is colored according to the affi liation of its cnn ( Fig. 1-4 are all of this type), cells in which more than 95% of allocated sequences are of a single type are colored red, and those with fewer than 5% of that type are white. Cells between these two   The "undecided" column indicates the percentage of sequences in the test set that could not be assigned to a sub-family or genus. The "correct" column indicates the percentage of assignable sequences that were correctly assigned. Optimal values are highlighted in yellow.
extremes are colored yellow. A ratio is then produced of red-to-yellow in each SOM. A perfectly partitioned SOM will therefore have a ratio of infi nity, indicating no mixed cells, or more accurately no cells with greater than 5% mixture of the "wrong" family. Figure 5 demonstrates that family level taxonomy is better determined at higher GS in all fi ve families of herpesviruses. The ratio of high alpha-density (>95%, red) to medium alpha-density (5% to 95%, yellow) increases from 0.88 to 2.83 as the GS increases from 2 to 4. The corresponding increases for the beta and gamma families are from 0.52 to 2.33 and from 1.91 to 2.11 respectively. For the ostreid herpesviral sequences, perfect partition is reached at GS-4 and for the ictalurid viruses at GS-3. This is probably a refl ection of the presence of a single virus in each of these categories with a correspondingly lower number of sequences analysed.

Effect of length of training phase of SOM
It is therefore apparent that genome signature of longer values of k produce some improvement in the accuracy of the fi nal partition on the SOM. However, longer k results in longer data vectors, increasing at order 4 k and therefore much slower training of the SOM. One way to speed training of the SOM is simply to reduce the number of training cycles. The effect of the number of iterations on density of each family is displayed in Figure 6. Figure 6 shows that increasing the number of iterations has a mixed effect on the density of family sequences. The alpha herpesviral sequences increase in density from 0.92 to 1.35 as the number of iterations increases from 10 to 1000, and the beta herpesviruses from 0.52 to 0.83. The ostreid herpesviral sequences are also perfectly clustered at 100 iterations. However, the gamma and ictalurid sequences are more poorly partitioned at higher numbers of iterations.

Jack-knifi ng analysis
Figures 1-6 provide a largely qualitative impression of the effectiveness of SOMs in correctly assigning the origins of DNA sequences based on their genome signature. To provide a further more quantitative assessment of the parameters of the process, a jack-knifi ng analysis was carried out. All herpesviral sequences were divided randomly into two groups. Genome signatures and SOMs were constructed as appropriate using one half. Then the remaining half was applied to the SOM to predict their origin at the family and genus level. To make a prediction concerning the origin of a data vector, the Euclidean distances between that vector and all of the weight vectors of the preconstructed SOM, are calculated. The origin of the nearest weight vector is taken to be the classifi cation of the data vector being tested. Where a data vector falls into a cell on the SOM containing none of the original data vectors used to construct the SOM, its origin is deemed to be "undecided" (Fig. 7).
When SOM size is varied for GS-2 at 100 iterations (Fig. 7, top left table), SOMs of greater than 10 × 10 introduce considerably uncertainty into the assignment. However, for those sequences that can be assigned, 95% accuracy at the subfamily level is achieved in a 50 × 50 SOM. Likewise, a 30 × 30 SOM gives 94% accuracy at the genus level. When SOM size is held at 10 × 10 and the signature length at GS-2 and the number of iterations is varied (Fig. 7, lower left table), there is little effect on the sensitivity. At the subfamily level, there are never more than 4.4% of sequences that cannot be assigned, and never more than 7.2% at the genus level. Where sequences can be assigned, optimal accuracy is achieved at 1000 or 5000 iterations, but the variation in accuracy is low. Increasing the iterations from 10 to 5000 only gives a 4% increase in accuracy of assignment at the sub-family level. When 100 iterations are used and the SOM size is held at 10 × 10 (Fig. 7, top right table), GS-4 or GS-5 appear to be optimal.

Discussion
Genome signatures provide a summary of the k-mer content of a genome, corrected for compositional bias. Various studies in a wide range of species have revealed that genome signatures are generally constant within genomes and similar in related genomes (Karlin and Ladunga, 1994;Karlin et al. 1998;Gentles and Karlin, 2001). The extent to which this is a phenomenon of neutral drift or one of active conservation is unknown. It is intuitively obvious that two identical genomes will have identical genome signatures, and that as they diverge the genome signatures will also diverge. Indeed this is the basis of a least one bioinformatical tool that assesses sequence relatedness Li et al. 2002). However, various suggestions have been made for conservative selection pressures which would act to maintain genome signature similarity in related organisms, including dinucleotide stacking energies, curvature, methylation, superhelicity, context-dependent mutation biases and effects deriving from related replication machinery (Karlin and Burge, 1995;Blaisdell et al. 1996). If these factors are similar within a clade, they might act as a brake on genome signature divergence. The conservation of genome signatures within genomes (which is what originally gave rise to the term "signature" in this context) would tend to suggest that signatures do not drift neutrally, at least within genomes. Figure 1 demonstrates that at the phylum level within the three superkingdoms of cellular life, satisfactory partition of GS-2 can be obtained by the SOM. However, this is less true for eubacteria than it is for eukaryotes and archaea. At the family level in viruses the picture is considerably more confused, with only the Baculoviridae demonstrating anything like territorial coherence on the SOM at GS-2 ( Fig.  1 and 2). This may well be a refl ection of speed of substitution in viral genomes. However at the species level, the same coherence within genomes as found in cellular organisms may well be the norm. For instance, when the ostreid and ictalurid herpesvirus families are included in a SOM with the Alloherpesviridae, these two families, both represented by a single viral genome, have strongly discrete areas on the SOM (Fig. 3 and 4).
This does not mean that genome signatures are not diagnostic tools for phylogenetic assignment at the family and sub-family level in herpesviruses, merely that the results should be interpreted with caution. The use of higher values of k appears to have a marginal effect on improving the discrete distribution of family-level herpesviral signatures on the SOM (Fig. 5) but jack-knifi ng indicates that this does not improve above k = 5 (Fig. 7). The effects of larger dimension SOMs and increased iterations are ambiguous at best. Optimal values appear to be around GS-4 or GS-5 with 500 to 1000 iterations of the SOM. The size of the SOM might be varied, with an initial run at high dimension (e.g. 50 × 50) followed by a lower dimension run (e.g. 10 × 10) for sequences unassigned by the fi rst run (Fig. 7).
The use of genome signatures in the identifi cation of pathogenicity islands is by now well established (Karlin, 1998;Karlin, 2001;Dufraigne et al. 2005). They are valuable in this context in that they indicate regions within genomes that have characteristics different to the rest of the genome. However, it is apparent from the present work that it is diffi cult on the basis of genome signatures to accurately identify the origin of the exogenous DNA. A BLAST search is more likely to generate informative hits in this context. Nevertheless for sequences that cannot be precisely identifi ed on the basis of alignment-based methods such as BLAST, genome signatures with SOMs holds out the prospect of identifi cation of origin to a reasonable level.
The optimization of SOM parameters reported here may also extend to other applications of SOMs. Of particular interest in bioinformatics is their use for the analysis of microarray data. The experimental design would be the same, with a standard microarray data set (e.g. the breast cancer data provided by Reid et al. 2005) substituting for the genome signature arrays. Dominance mapping would be done by clinical outcome, and jack-knife analysis could test the accuracy and sensitivity of assignment of that outcome.